Skip to content
This repository has been archived by the owner on May 3, 2022. It is now read-only.

Better reporting for stuck Deployment #256

Merged
merged 2 commits into from
Jan 22, 2020
Merged

Conversation

juliogreff
Copy link
Contributor

This builds on top of #242.

One of the bigger challenges for shipper users is to decide if a rollout
is waiting for the right bits to be flipped up in the kubernetes cloud,
or if something went wrong with no hope of being fixed without
intervention. Although this commit does not fundamentally fix that (as
it would be very involved and error prone, requiring the capacity
controller to have intimate details of replica sets and pods, which we
want to avoid), we're now checking for a few more conditions in the
Deployment that surface known situations:

  • A Deployment has just been changed, and its Status does not reflect
    the brand new Spec. In that case, we considered capacity to be just in
    progress.

  • A Deployment times out. This is not super common, as it requires users
    to define a timeout in the Deployment itself. If that ever happens,
    though, we're covered :)

  • A Deployment would violate quotas, or would otherwise cause the
    ReplicaSet to be in a state of error. That's insanely common, and also
    super hard to diagnose without knowing that this is an error condition
    to begin with.

@juliogreff juliogreff added the enhancement New feature or request label Jan 8, 2020
@juliogreff juliogreff self-assigned this Jan 8, 2020
return nil, err
}

patchString := fmt.Sprintf(`{"spec": {"replicas": %d}}`, replicaCount)
patch := []byte(fmt.Sprintf(`{"spec": {"replicas": %d}}`, replicaCount))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we introduce some nice-looking abstraction maybe?

osdrv
osdrv previously approved these changes Jan 20, 2020
One of the bigger challenges for shipper users is to decide if a rollout
is waiting for the right bits to be flipped up in the kubernetes cloud,
or if something went wrong with no hope of being fixed without
intervention. Although this commit does not fundamentally fix that (as
it would be very involved and error prone, requiring the capacity
controller to have intimate details of replica sets and pods, which we
want to avoid), we're now checking for a few more conditions in the
Deployment that surface known situations:

* A Deployment has just been changed, and its Status does not reflect
the brand new Spec. In that case, we considered capacity to be just in
progress.

* A Deployment times out. This is not super common, as it requires users
to define a timeout in the Deployment itself. If that ever happens,
though, we're covered :)

* A Deployment would violate quotas, or would otherwise cause the
ReplicaSet to be in a state of error. That's insanely common, and also
super hard to diagnose without knowing that this is an error condition
to begin with.
Most of these have been moved into the controllers themselves, and while
that's not necessarily ideal, it's much better than a random "utils"
package.
@hihilla hihilla merged commit 7fa36c2 into master Jan 22, 2020
@hihilla hihilla deleted the jgreff/deployment-conditions branch January 22, 2020 14:20
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants